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Abstract 



We study the problem of estimating from data, a sparse approximation to the inverse covariance matrix. Estimating 
a sparsity constrained inverse covariance matrix is a key component in Gaussian graphical model learning, but one 
that is numerically very challenging. We address this challenge by developing a new adaptive gradient-based method 
that carefully combines gradient information with an adaptive step-scaling strategy, which results in a scalable, highly 
competitive method. Our algorithm, like its predecessors, maximizes an i?i-norm penalized log-likelihood and has the 
same per iteration arithmetic complexity as the best methods in its class. Our experiments reveal that our approach 
outperforms state-of-the-art competitors, often significantly so, for large problems. 

1 Introduction 

A widely employed multivariate analytic tool is a Gaussian Markov Random Field (GMRF), which in essence, simply 
defines Gaussian distributions over an undkected graph. GMRFs are enormously useful; familiar applications include 
speech recognition ID, time-series analysis, semiparametric regression, image analysis, spatial statistics, and graphical 
model learning fT7ll29ll3Tl. 

In the context of graphical model learning, a key objective is to learn the structure of a GMRF 1291 |3T| . Learning 
this structure often boils down to discovering conditional independencies between the variables, which in turn is 
equivalent to locating zeros in the associated inverse covariance (precision) matrix — ^because in a GMRF, an edge 
is present between nodes i and j, if and only if ^ lfT7]|29l . Thus, estimating an inverse covariance matrix with 
zeros is a natural task, though formulating it involves two obvious, but conflicting goals: (i) parsimony and (ii) fidelity. 

It is easy to acknowledge that fidelity is a natural goal; in the GMRF setting it essentially means obtaining a statis- 
tically consistent estimate (e.g. under mild regularity conditions a maximum-likelihood estimate |22|). But parsimony 
too is a valuable target. Indeed, it has long been recognized that sparse models are more robust and can even lead to 
better generalization performance |9|. Furthermore, sparsity has great practical value, because it: (i) reduces storage 
and subsequent computational costs (potentially critical to settings such as speech recognition |4|); (ii) leads to GM- 
RFs with fewer edges, which can speed up inference algorithms EOllSTl ; (iii) reveals interplay between variables and 
thereby aids structure discovery, e.g., in gene analysis lllOl . 

*This work was done when SS was affiliated with the Max Planck Institute for Biological Cybernetics. We finished this preprint on June 3, 2010, 
but have uploaded it after more than a year to arXiv (in June 20 11), so it is not the definitive version of this work. 
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Achieving parsimonious estimates while simuhaneously respecting fidelity is the central aim of the Sparse In- 
verse Covariance Estimation (SICE) problem. Recently, d'Aspremont et al. |i8J formulated SICE by considering the 
penaUzed Gaussian log-likelihood IS) (omitting constants) 

logdet(X)-Tr(5X)-py \x,,l 

where S is the sample covariance matrix and p > is a parameter that trades-off between fidelity and sparsity. Maxi- 
mizing this penalized log-likelihood is a non-smooth concave optimization task that can be numerically challenging for 
high-dimensional S (largely due to the positive-definiteness constraint on X). We address this challenge by develop- 
ing a new adaptive gradient-based method that carefully combines gradient information with an adaptive step-scaling 
strategy. And as we will later see, this care turns out to have a tremendous positive impact on empirical performance. 
However, before describing the technical details, we put our work in broader perspective by first enlisting the main 
contributions of this paper, and then summarizing related work. 

1.1 Contributions 

This paper makes the following main contributions: 

1. It develops and efficient new gradient-based algorithm for SICE. 

2. It presents associated theoretical analysis, formalizing key properties such as convergence. 

3. It illustrates empirically not only our own algorithm, but also competing approaches. 

Our new algorithm has 0(r?^ arithmetic complexity per iteration like other competing approaches. However, our 
algorithm has an edge over many other approaches, because of three reasons: (i) the constant within the big-Oh is 
smaller, as we perform only one gradient evaluation per iteration; (ii) the eigenvalue computation burden is reduced, 
as we need to only occasionally compute positive-definiteness; and (ii) the empirical convergence turns out to be rapid. 
We substantiate the latter observation through experiments, and note that state-of-the-art methods, including the recent 
sophisticated methods of Lu IJSi il9J . or the simple but effective method of U IJ , are both soundly outperformed by 
our method. 

Notation: We denote by 5" and 5", the set of n x n symmetric, and semidefinite ()^ 0) matrices, respectively. The 
matrix S G 5" denotes the sample covariance matrix on n dimensional data points. We use \X\ to denote the matrix 
with entries \xij\. Other symbols will be defined when used. 

1.2 Related work, algorithmic approaches 

Dempster |9| was the first to formally investigate SICE, and he proposed an approach wherein certain entries of the 
inverse covariance matrix are explicitly set to zero, while the others are estimated from data. Another old approach 
is based on greedy forward-backward search, which is impractical as it requires 0{r?) MLEs ifTTl l29l . More re- 
cently, Meinshausen and Biihlmann [211 presented a method that solves n different Lasso subproblems by regressing 
variables against each other: this approach is fast, but approximate, and it does not yield the maximum-likelihood 
estimator |8 , 13 1. 

An £i -penalized log-likelihood maximization formulation of SICE was considered by |8| who introduced a nice 
block-coordinate descent scheme (with 0(n'*) per-iteration cost); their scheme formed the basis of the graphical lasso 
algorithm 1 13 1, which is much faster for very sparse graphs, but unfortunately has unknown theoretical complexity and 
its convergence is unclear llT8l[T9l . There exist numerous other approaches to SICE, e.g., the Cholesky-decomposition 
(of the inverse covariance) based algorithm of ll28l . which is an expensive 0{n^) per-iteration method. There are 
other Cholesky-based variants, but these are either heuristic 1 15], or limit attention to matrices with a known banded- 
structure |3|. Many generic approaches are also possible, e.g., interior-point methods I.24J (usually too expensive for 
SICE), gradient-projection EUSIHIIIISIEtI, or projected quasi-Newton [6. 301. 
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Due to lack of space we cannot afford to give the various methods a survey treatment, and restrict our attention to 
the recent algorithms of [18. .19J and of [1 U . as these seem to be over a broad range of inputs the most competitive 
methods. Lu ifTSl optimizes the dual (|7| by applying Nesterov's techniques 123)1 . obtaining an 0(1/ -/e) iteration com- 
plexity algorithm (for e-accuracy solutions), that has 0{n^) cost per iteration. He applies his framework to solve ([T]) 
(with R = pee^), though to improve the empirical performance he derives a variant with slightly weaker theoretical 
guarantees. In ||T9l . the author derives two "adaptive" methods that proceed by solving a sequence of SICE problems 
with varying R. His first method is based on an adaptive spectral projected gradient |5|, while second is an adap- 
tive version of his Nesterov style method from 1J_8|. Duchi et al. mj presented a simple, but surprisingly effective 
gradient-projection algorithm that uses backtracking line-search initiaUzed with a "good" guess for the step-size (ob- 



tained using second-order information). We remark, however, that even though Duchi et al. s algorithm [ 1 1 1 usually 
works well in practice, its convergence analysis has a subtle problem: the method ensures descent, but not sufficient 
descent, whereby the algorithm can "converge" to a non-optimal point. Our express goal is the paper is to derive an 
algorithm that not only works better empirically, but is also guaranteed to converge to the true optimum. 

1.3 Background: Problem formulation 

We begin by formally defining the SICE problem, essentially using a formulation derived from lISl lTTIl : 

min F{X) = Tv{SX) - logdct(X) + Ti{R\X\). (1) 

Here R E 5" is an elementwise non-negative matrix of penalty parameters; if its ij-entry Vij is large, it will shrink | Xij \ 
towards zero, thereby gaining sparsity. Following fT9l we impose the restriction that 5+Diag(i?) >- 0, where Diag(i?) 
is the diagonal (matrix) of R. This restriction ensures that ([T]) has a solution, and subsumes related assumptions made 
in |l8l[TT]. Moreover, by the strict convexity of the objective, this solution is unique. 

Instead of directly solving ([T]), we also (like ||8][TT][T8l|T9l) focus on the dual as it is differentiable, and has merely 
bound constraints. A simple device to derive dual problems to £i -constrained problems is lfT6ll : introduce a new 
variable Z = X and a corresponding dual variable U. Then, the Lagrangian is 

L{X, Z, U) = Tt{SZ) - logdet(Z) + + Tr(C/(Z - X)). (2) 

The dual function g{U) = mix,z L{X, Z, U). Computing this infimum is easy as it separates into infima over X and 
Z. We have 

inf Tr(S'Z) - logdet(Z) + Tr(C/Z), (3) 



which yields 
Next consider 



S-Z-^ + U^Q, or Z={S + U)-'^. (4) 



inf ||i?0X||i4 - Tr(C/X) = inf pjj|a;jj| - UyXij, (5) 

which is unbounded unless \uij \ < pij, in which case it equals zero. This, with Q yields the dual function 

g{U) = 1^'^^^ + ^^^^ ' logdct(Z) \U,,\ < 

I —CO otherwise. 

Hence, the dual optimization problem may be written as (dropping constants) 

min - logdet(S' + ?7) 
V (7) 
s.t. \uij\ < p^j, i,j e [n]. 
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Observe that — = (S + U) ^ — X; so we easily obtain the primal optimal solution from the dual-optimal. 
Defining, Xjj — {S + U)^^, the duality gap may be computed as 

J^iXu) - g{U) = 1t{SXu) + Tr(i?|Xc/|) - n. (8) 

Given this problem formalization we are now ready to describe our algorithm. 

2 Algorithm 

Broadly viewed, our new adaptive gradient-based algorithm consists of three main components: (i) gradient compu- 
tation; (ii) step-size computation; and (iii) "checkpointing" for positive-definiteness. Computation of the gradient is 
the main 0{ti?) cost that we (like other gradient-based methods), must pay at every iteration. Even though we com- 
pute gradients only once per iteration, saving on gradient computation is not enough. We also need a better strategy 
for selecting the step-size and for handling the all challenging positive-definiteness constraint. Unfortunately, both 
are easier said than done: a poor choice of step-size can lead to painfully slow convergence ID, while enforcing 
positive-definiteness requires potentially expensive eigenvalue computations. The key aspect of our method is a new 
step-size selection strategy, which helps greatly accelerate empirical convergence. We reduce the unavoidable cost of 
eigenvalue computation by not performing it at every iteration, but only at certain "checkpoint" iterations. Intuitively, 
at a checkpoint the iterate must be tested for positive-definiteness, and failing satisfaction an alternate strategy (to be 
described) must be invoked. We observed that in practice, most intermediate iterates generated by our method usually 
satisfy positive-definiteness, so the alternate strategy is triggered only rarely. Thus, the simple idea of checkpointing 
has a tremendous impact on performance, as borne through by our experimental results. Let us now formalize these 
ideas below. 

We begin by describing our stepsize computation, which is derived from the well-known Barzilai-Borwein (BB) 
stepsize |1| that was found to accelerate {unconstrained) steepest descent considerably |1, 7/251. Unfortunately 
the BB stepsizes do not work out-of-the-box for constrained problems, i.e., naiively plugging them into gradient- 
projection does not work Q. For constrained setups the spectral projected gradient (SPG) algorithm IS) is a popular 
method based on BB stepsizes combined with a globalization strategy such as non-monotone line-searches fTTl. For 
SICE, very recently Lu [19] exploited SPG to develop a sophisticated, adaptive SPG algorithm for SICE. For deriving 
our BB-style stepsizes, we first recall that for unconstrained convex quadratic problems, steepest-descent with BB is 
guaranteed to converge lfT2]| . This immediately suggests using an active-set idea, which says that if we could identify 
variables active at the solution, we could reduce the optimization to an unconstrained problem, which might be "easier" 
to solve than its constrained counterpart. We thus arrive at our first main idea: use the active-set information to modify 
the computation of the BB step itself. This simple idea turns out to have a big impact on empirical consequences; we 
describe it below. 

We partition the variables into active and working sets, and carry out the optimization over the working set alone. 
Moreover, since gradient information is also available, we exploit it to further refine the active-set to obtain the binding 
set; both are formalized below. 

Definition 1. Given dual variable U, the active set A{U) and the binding set B{U) are defined as 

A{U)^{{t,j)\ \U,,\^p,,},, (9) 
BiU) = I = -p,j, d,jg{U) > 0, or U,j = p^, d,,g{U) < 0}, (10) 

where dijQ{U) denotes the (i, j) component of VQ{U). The role of the binding set is simple: variables bound 
at the current iteration are guaranteed to be active at the next. Specifically, let Z// — { C/ | | Uij \ = pij } , and denote 
orthogonal projection onto U by Pu (•), i.e., 

Pu (Uij) = mid{C/„, -pij, py }. (11) 
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If e B{U^) and we iterate U^+'^ = Pu {W" - j''Wg{U'')) for some 7''' > 0, then since 

the membership E A{U''^^) holds. Therefore, if we know that G B{U^), we may discard Uf^ from 

the update. We now to employ this idea in conjunction with the BB step, and introduce our new modification: first 
compute B{U'^) and then confine the computation of the BB step to the subspace defined by (i, j) ^ B{U'^). The 
last ingredient that we need is safe-guard parameters amin, Cmax, which ensure that the BB-step computation is well- 
behaved lfT4l[T9ll26ll . We thus obtain our modified computation, which we call modified-BB (MBB) step: 

Definition 2 (Modified BB step). 

u = — 1 — ^7— r (a)- or a = — (h), 

EiEj^u^'-^G^f' W^S'-'Wl (12) 

where AU'^^^ and A^'^^^ are defined by over only non-bound variables by the following formulas: 



'■J I 0, otherwise. 



0, otherwise. 



Assume for the moment that we did not have the positive-definite constraint. Even then, our MBB steps de- 
rived above, are not enough to ensure convergence. This is not surprising since even for the unconstrained gen- 
eral convex objective case, ensuring convergence of BB-step-based methods, globalization (line-search) strategies are 
needed lfT4l l26l : the problem is only worse for constrained problems too, and some line-search is needed to ensure 
convergence ||5][T2][14|. The easy solution for us too would be invoke a non-monotone line-search strategy. However, 
we observed that MBB step alone often produces converging iterates, and even lazy line-search in the style of Il7l [l4ll 
affects it adversely. Here we introduce our second main idea: To retain empirical benefits of subspace steps while 
still guaranteeing convergence, we propose to scale the MBB stepsize using a diminishing scalar sequence, but not 
out-of-the-box; instead we propose to relax the application of diminishing scalars by introducing an "optimistic" di- 
minishment strategy. Specifically, we scale the MBB step ( [T2j i with some constant scalar S for a fixed number, say M, 
of iterations. Then, we check whether a descent condition is satisfied. If the current iterate passes the descent test, then 
the method continues for another M iterations with the same 6; if it fails to satisfy the descent, then we diminish the 
scaling factor 5. This diminishment is "optimistic" because even when the method fails to satisfy the descent condition 
that triggered diminishment, we merely diminish S once, and continue using it for the next M iterations. We remark 
that superficially this strategy might seem similar to occasional line-search, but it is fundamentally different: unlike 
hne-search our strategy does not enforce monotonicity after failing to descend for a prescribed number of iterations. 
We formalize this below. 

Suppose that the method is at iteration c, and from there, iterates with a constant (5^ for the next M iterations, so 
that from the current iterate If^, we compute 

U^+^=Pu{U''-S''-a^Wg{U'')), (15) 

for /c = c, c + 1, • • • , c + M — 1, where cr''' is computed via ([T2|i-(a) and ([T2]i-(b) alternatingly. Now, at the iterate 
^c+M ^ we check the descent condition: 



5 



Definition 3 (Descent condition). 

i 3 

for some fixed parameter k € (0, 1). 

If iterate U'^^^^ passes the test ( [T6] l, then we reuse 5'^ and set 5'^'^^ = 5'^; otherwise we diminish 5'^ and set 
^c+M j_ . ^Qj. gQjjjg fixed rj e (0; 1). After adjusting 5*=+*^ the method repeats the update ( fTSj ) for another M 
iterations. Finally, we measure the duality gap every M iterations to determine termination of the method. Explicitly, 
given a stopping threshold e > 0, we define: 

Definition 4 (Stopping criterion). 

T{^vg{u)) - g{u) = F{Xu) - g{u) = i:v{sxij) + Tr(i? \Xu\) - « < e. (i?) 

Algorithm[T]encapsulates all the above details, and presents pseudo-code of our algorithm SICE. 



Given 1/° and U^; 


for i 


= 1, • • • until the stopping criterion met do 




jjO ^ jji-l jjl ^_ jji. 




for j = 1, • • • ,M do /* Modified BB */ 






Compute B{U^) then compute using ([T2]l-(a) and -(b) alternatingly; 






c/j+i ^ Pu (u^ - s' ■ a^vg{w)y. 




iilsS + fj )~ Otlien 






if ty^' satisfies ([T6| tlien 






fj'+i J7^^ and (5*+i ^ (5^ 






else /* Diminish Optimistically */ 






[_ 6'+^ ^ T]5\ where rj e (0, 1); 




else 






Enforce posdef; 












if C/*^ satisfies ^ then 






C/'+i ^ J7^,and(5*+i 4- (5^ 






else /* Diminish Optimistically */ 






1^ (5^+^ ^ ri5\ where ry e (0, 1); 



Algorithm 1: Sparse Inverse Covariance Estimation (SICE). 



3 Analysis 

In this section we analyze theoretical properties of SICE. First, we establish convergence, and then briefly discuss 
some other properties. 

For clarity, we introduce some additional notation. Let M be the fixed number of modified-BB iterations, i.e., 
descent condition ([T6| is checked only every M iterations. We index these i\/-th iterates with /C = {1, M + 1, 2M + 
1, 3M + 1, • • • }, and then consider the sequence {J/'^}, r e /C generated by SICE. Let U* denote the optimal solution 
to the problem. We first show that such an optimal exists. 
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Theorem 5 (Optimal). If R > 0, then there exists an optimal U* to (|7| such that 

X*{S + U*)=I, and Tt{X*U*) ^Tt{R\X*\). 

Proof. From lfT9l Proposition 2.2] (or the supplementary material), We know that Problem ([T]) has a unique optimal 
solution X* G On the other hand, from 1 11 , Lemma 1], there exists a feasible point U E mt{U), thereby Slater's 
constraint qualification guarantees that the theorem holds. □ 

We remind the reader that when proving convergence of iterative optimization routines, one often assumes Lips- 
chitz continuity of the objective. We, therefore begin our discussion by stating the fact that the dual objective function 
G{U) is indeed Lipschitz continuous. 

Proposition 6 (Lipschitz constant M18I ). The dual gradient VfJ(C/) is Lipschitz continuous on lA with Lipschitz con- 
stant L = \^^^{X*){ma3iij TijY. 

We now prove that {V^} — )■ t/* as r — >■ cxo. To that end, suppose that the diminishment step (5'=+^^ ^ r; • ^'^ is 
triggered only a finite number of times. Then, there exists a sufficiently large K such that for all r G /C, r > K, 



r+1-] 



In such as case, we may view the sequence {[/'"} as if it were generated by an ordinary gradient projection scheme, 
whereby the convergence } — > U* follows using standard arguments |2|. Therefore, to prove the convergence of 
the entire algorithm, it suffices to discuss the case where the diminishment occurs infinitely often. Corresponding to 
an infinite sequence {C/'"}, there is a sequence {(5' }, which by construction is diminishing. Hence, if we impose the 
following unbounded sum condition on this sequence, following |,2J, we can again ensure convergence. 

Definition 7 (Diminishing with unbounded sum). 

(5* = OO. 

Since in our algorithm, we scale the MBB step ( [T2] i ct'' by a diminishing scalar 5^ , we must ensure that the resulting 
sequence 5^ a'' also satisfies the above condition. This is easy, but for the reader's convenience formalized below. 

Proposition 8. In Algorithm^ the stepsize 5^ ■ a"^ satisfies the condition^ 

Proof. Since limr^oo 5^ — Q and limr^oo Yl\=i = oo by construction, 

r r 

lim 5^ ■ a"^ < cr,„ax ' lim 5^ ^ Q and lim 6^ ■ cr* > cr,„in • lim \^ 6^ = oo. □ 

r— >-oo ) — ^oo r—^oo ^ — ^ r— voo ^ — ^ 

i=l 1=1 

Using this proposition we can finally state the main convergence theorem. Our proof is based on that of gradient- 
descent; we adapt it for our problem by showing some additional properties of {U^}. 

Theorem 9 (Convergence). The sequence of iterates {X^ ^ U*') generated by Algorithm 1 converges to the optimum 
solution {X* , U*). 

Proof. Consider the update ([TSj of Algorithm [T| we can rewrite it as 

L/'^+i = C/'^ + (5'' • (7'■D^ (18) 
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where the descent direction satisfies 



'0, if(i,j) €S(t/n, 



-midj^l^-^, d.^giU^Y otherwise. 



Using ([T9]l and the fact that there exists at least one \dijQ{U'^)\ > Ounless U"^ = t/*, we can conclude that there exists 
a constant ci > such that 

-EE^^^^(^')-^^. ^ E ^ c,\\\/g{un\\i > 0, (20) 

i J (i..MB(U-) 

where m,, =mid{'^J=^, "^i^, a,,5(C/'-)}. 
Similarly, we can also show that there exists C2 > such that 

W\\l<c^\\^Q{u"-)\\l. (21) 

Finally, using the inequalities pO] ) and ( pT) , in conjunction with Proposition [6j [8] the proof is immediate from Propo- 
sition 1.2.4 in L2J. □ 



4 Numerical Results 

We present numerical results on both synthetic as well as real-world data. We begin with the synthetic setup, which 
helps to position our method via— vis the other competing approaches. We compare against algorithms that represent 
the state-of-the-art in solving the SICE problem. Specifically, we compare the following 5 methods: (i) PG: the pro- 
jected gradient algorithm of lfTn |^(ii) ASPG: the very recent adaptive spectral projected gradient algorithm 1 19 |0:iii)ANES: 
an adaptive, Nesterov style algorithm from the ASPG paper |19|;^ (iv) SGOV: a sophisticated smooth optimization 
scheme built on Nesterov's techniques 1 18 1 J^and (v) SICE: our proposed algorithm. 

We note at this point that although due to space limits our experimentation is not exhaustive, it is fairly extensive. 
Indeed, we also tried several other baseline methods such as: gradient projection with Armijo-search rule Q, limited 
memory projected quasi-Newton approaches L6ii30J, among others. But all these approaches performed across wide 
range of problems similar to or worse than PG. Hence, we do not report numerical results against them. We further 
note that Lu |18| reports the SCOV method vastly outperforms block-coordinate descent (BCD) based approaches 
such as 18J, or even the graphical lasso FORTRAN code of Friedman et al. [13J . Hence, we also omit BCD methods 
from our comparisons. 

4.1 Synthetic Data Experiments 

We generated random instances in the same manner as d'Aspremont et al. lIS); the same data generation scheme was 
also used by Lu ITSl [T9l , and to be fair to them, we in fact used their MATLABcode to generate our data (this code 
is included in the supplementary material for the interested reader). First one samples a sparse, invertible matrix 
S' e 5", with ones on the diagonal, and having a desired density 5 obtained by randomly adding +1 and —1 off- 
diagonal entries. Then one perturbs S' to obtain S" — [S')^^ + tN, where N € 5" is an i.i.d. uniformly random 
matrix. Finally, one obtains S, the random sample covariance matrix by shifting the spectrum of 5" by 

5 = 5"-min{A„i„(5")-^,0}/, 

where > is a small scalar. The specific parameter values used were the same as in ifTSl . namely: S = .01, r = .15, 
and q = lO^"*. We generated matrices of size 100m x 100m, for me [1 ... 10] U {16, 20}. 

'impl. 

"Downloaded from 
^Downloaded from 
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e 




PG 


ASPG 


ANES 


SCOV 


SICE 


e 




PG 


ASPG 


ANES 


SCOV 


SICE 


lE-01 




8.46 


38.97 


48.05 


21.75 


2.28 


lE-01 




128.61 


1005.98 


1818.00 


1858.70 


66.08 


lE-02 




34.89 


72.25 


100.61 


46.10 


4.74 


lE-02 




358.90 


1712.01 


3173.55 


3216.55 


102.48 


lE-03 




127.37 


142.47 


214.56 


98.73 


7.61 


lE-03 




833.86 


3529.29 


5411.49 


5085.51 


170.25 


lE-04 




807.16 


519.37 


482.97 


224.58 


17.30 


lE-04 




1405.56 


5008.81 


9528.91 


9046.04 


234.18 


2E-05 






1914.33 


862.86 


409.06 


42.38 


2E-05 






6420.18 


15964.26 


15428.23 


307.67 


lE-05 






2435.14 


1 149.04 


548.34 


52.29 


lE-05 






7562.52 


19812.06 


19621.27 


336.25 



Table 1; Time (sees) for reaching desired duality gap. Left, S 6 >S^J_'"^, and right, S £ cS^,""". Since all algorithms decrease the duality-gap 
non-monotonically, we show the time-points when a prescribed e is first attained. 



Table [T] shows two sample results (more extensive results are available in the supplementary material). All the 
algorithms were run with a convergence tolerance of e = 2 • 10~^; this tolerance is very tight, as previously reported 
results usually used e = .1 or .01. Table [T] reveals two main results: (i) the advanced algorithms of flSllTSl are out- 
performed by both PG and SICE; and (ii) SICE vastly outperforms all other methods. The first result can be attributed 
to the algorithmic simplicity of PG and SICE, because ASPG, ANES, and SCOV employ a complex, more expensive 
algorithmic structure: they repeatedly use eigendecompositions. The second result, that is, the huge difference in 
performance between PG and SICE (see Table[r| second sub-table, fourth and fifth rows) is also easily explained. Al- 
though the step-size heuristic used by PG works very-well for obtaining low accuracy solutions, for slightly stringent 
convergence criteria its progress becomes much too slow. SICE on the other hand, exploits the gradient information 
to rapidly identify the active-set, and thereby exhibits faster convergence. In fact, SICE turns out to be even faster for 
low accuracy (e = .1) solutions. 

The convergence behavior of the various methods is illustrated in more detail in Figure [T] The left panel plots the 
duality gap attained as a function of time, which makes the performance differences between the methods starker: SICE 
converges rapidly to a tight tolerance in a fraction of the time needed by other methods. The right panel summarizes 
performance profiles as the input matrix size n x n is varied. Even as a function of n, SICE is seen to be overall much 
faster (y-axis is on a log scale) than other methods. 





Running time (seconds) 



Duality gap: s=1.0E-02 




200 400 600 800 1000 
Size ot matrix (n x n) 



200 400 600 800 1000 
Size of matrix (n x n) 



Figure 1 : Left panel: duality gap attained as a function of running time (in seconds); Right panel: Time taken to reach a prescribed 
duality gap as a function of matrix S"s size. 



5 Conclusions and future work 

We developed, analyzed, and experimented with a powerful new algorithm for estimating the sparse inverse covariance 
matrices. Our method was shown to outperform all the competing state-of-the-art approaches. We attribute the speedup 
gains to our careful algorithmic design, which also shows that even though the spectral projected gradient method is 
fast and often successful, invoking it out-of-the-box can be a suboptimal approach in scenarios where the constraints 
were simple. In such as setup, our modified spectral approach turns out to afford significant empirical gains. 
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At this point several avenues of future work are open. We list them in order of difficulty: (i) Implementing our 
method in a multi-core or GPU setting; (ii) Extending our approach to handle more general constraints; (iii) Deriving 
an even more efficient algorithm that only occasionally computes gradients; (iv) Deriving a method for SICE that does 
not need to expUcitly compute matrix inverses. 
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